Introduction and data info
This document shows a worked example of analyzing 24-hour
accelerometry data, using a subset of the open-access Long-term Movement
Monitoring Database. I’ve already summarized the raw 100 Hz
acceleration data and summarized it in one-minute epochs as “MIMS units”
via the MIMSunit R package (link here). MIMS
units are very similar to “counts” available from Actigraph devices, but
they have some technical and practical advantages. Check out the
paper for more on MIMS units.
The code and data accompanying this demo can be found at GitHub
here.
I made this demo to accompany my talk for the 2023 Mathematical
Sciences in Obesity Research short course. This is an
educational analysis meant to be quick and feasible on a
lower-end laptop, which put some hard constraints on how much and how
complex of a dataset we can work with. It also skips over some of the
finer points that really do matter in research studies, like what to do
about non-wear-time and other sources of missingness in your data. Also,
in part because of the small size of the dataset, we don’t get
particularly interesting or insightful results. Repeat this analysis on
the entire NHANES cohort and it’d probably be a different story.
The present dataset consists of waist-worn accelerometer data from
one 24-hour day (midnight to midnight) from 50 older adults. These older
adults were classified as “fallers” (people who had a history of falls)
and “controls” (no history of falls). This is a subset of the larger
LTMMD; I preprocessed only subjects with a reasonable amount of wear
time in their first 24-hour day of device wear. The full dataset is
larger (~80 subjects) and has multiple days of data from each
subject.
Main goals
This educational demo is intended to showcase a few aspects of
analyzing 24-hour accelerometer data:
- The right-skewed distribution typically seen in both aggregate and
minute-by-minute accelerometer data
- Potential strategies for modeling continuous and categorical
predictors of accelerometer-measured physical activity
- How to include smooth, nonlinear predictors in models of activity
data
- A functional data analysis (FDA) approach to modeling 24-hour
activity
Read in data
As noted above, the data are already preprocessed into one-minute
epochs. This substantially reduces the size of the data we need to
download. Notice how for each subject we’re calculating a
total_MIMS variable that is just a sum of all of their
accumulated MIMS units throughout the day.
library(tidyverse)
library(viridis) #pretty colors
library(mgcv) #for semiparametric regression
library(refund) #for functional regression
library(reshape2) #just for melt
library(gratia) #Viewing mgcv model results
# --- Load in our files and main dataframe ----
main_df <- read_csv("data/main_df.csv", show_col_types = FALSE)
all_files <- list.files("data/day_mims/", pattern="*.csv")
#Preallocate
df_list <- list()
day_list <- list()
subject_df_list <- list()
Y <- matrix(data=NA,nrow=length(all_files),ncol=1440) #1440 minutes in a day
for (i in 1:length(all_files)){
f <- all_files[i]
this_subject <- gsub("\\_mims.csv$", "", f)
this_df <- read_csv(paste("data/day_mims/", f, sep=''), show_col_types = FALSE) %>%
mutate(hours = seq(0,24,length.out=1440),
sub_code = this_subject)
#Total activity
this_subject_df <- main_df %>% filter(sub_code == this_subject) %>%
mutate(total_mims = sum(this_df$MIMS_UNIT))
day_list[[i]] <- this_df
Y[i,] <- this_df$MIMS_UNIT #Drop in 24hr activity data
df_list[[i]] <- this_subject_df #Slightly untidy but works
}
scalar_df <- bind_rows(df_list)
day_df <- bind_rows(day_list)
Examine one day
Here’s a pretty typical day from the dataset:
ix <- 2 #Which subject?
hour_seq <- seq(0, 24, by = 3)
formatted_hours <- sprintf("%02d:00", hour_seq)
ggplot(mapping = aes(x=seq(0,24,length.out=1440),
y = Y[ix,])) +
geom_ribbon(aes(ymax = Y[ix,], ymin=0), fill = "navy", alpha = 0.4) +
geom_line(linewidth = 0.5, color = "navy") +
ggtitle("24hrs of activity data, 1min summary") +
scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.01,0),
labels = formatted_hours,
name = 'Time of day') +
scale_y_continuous(name = 'Activity level (MIMS units)') +
theme(plot.title = element_text(hjust=0.5))

And here’s data from several subjects:
plt_subs <- day_df$sub_code %>% unique() %>% head(9)
day_df %>%
filter(sub_code %in% plt_subs) %>%
ggplot(aes(x=hours, y=MIMS_UNIT, group = sub_code)) +
geom_ribbon(aes(ymax = MIMS_UNIT, ymin=0), fill = "navy", alpha = 0.4) +
geom_line(linewidth = 0.25, color="navy") +
facet_wrap(~sub_code, ncol=3)

It’s also worth looking at the histogram of MIMS units throughout the
day:
ggplot(mapping = aes(x=Y[ix,])) +
geom_histogram(aes(y = ..density..), colour="black", fill="navy",
alpha = 0.2, bins = 25) +
geom_density(color = "navy", size=1) +
ggtitle("Distribution of activity level during the day") +
scale_x_continuous(name = "Activity level (MIMS units)")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

It’s very right-skewed! This can be modestly improved with a
log(x+1) transform (+1 because there are zero values).
ggplot(mapping = aes(x=log(Y[ix,] + 1))) +
geom_histogram(aes(y = ..density..), colour="black", fill="navy",
alpha = 0.2, bins = 25) +
geom_density(color = "navy", size=1) +
ggtitle("Distribution of activity level during the day") +
scale_x_continuous(name = "Log(activity + 1)")

Aggregating daily activity
A simple way to summarize someone’s activity level is to just add up
their activity across the day, as we did above. Here’s the distribution
of total MIMS units, at the subject level:
scalar_df %>%
ggplot(aes(x=total_mims)) +
geom_histogram(aes(y = ..density..), colour="black", fill="navy",
alpha = 0.2, bins = 15) +
geom_density(color = "navy", size=1) +
ggtitle("Sum of daily activity (N=50 subjects)") +
scale_x_continuous(name = "Total MIMS units")

Again, the situation improves a bit with a log transform (no plus one
here since there are no zeros).
scalar_df %>%
ggplot(aes(x=log(total_mims))) +
geom_histogram(aes(y = ..density..), colour="black", fill="navy",
alpha = 0.2, bins = 15) +
geom_density(color = "navy", size=1) +
ggtitle("Sum of daily activity (N=50 subjects)") +
scale_x_continuous(name = "log(Total MIMS units)")

From a modeling situation either modeling total_mims or
log(total_mims) is likely justifiable. Here’s a simple
model testing to see if there is a (smooth, possibly nonlinear)
association between age and total activity level.
#k is number of knots for the spline - with REML penalization, we just need "enough"
# - too many is not really a big concern
k <- 7 # an approximate rule for k is sqrt() of number of unique values, here 49
age_model <- gam(total_mims ~ s(age, bs="cr", k=k),
method = "REML",
data = scalar_df)
summary(age_model)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## total_mims ~ s(age, bs = "cr", k = k)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2840.6 152.7 18.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(age) 1 1 4.067 0.0493 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.0589 Deviance explained = 7.81%
## -REML = 405.99 Scale est. = 1.1664e+06 n = 50
gratia::draw(age_model)

So, we have weak evidence that people get less active with age (not
surprising). In this case it does seem like a linear fit would be
sufficient, as evidenced in part by the edf of 1 for the
smooth term for age. With larger datasets this will not generally be the
case, though. We could’ve fit to log(total_mims); it
improves the R^2 value a bit, but doesn’t make a huge difference.
In any case, here’s a linear fit to the data:
scalar_df %>%
ggplot(aes(x=age, y=total_mims)) +
geom_point() +
geom_smooth(method="lm")
## `geom_smooth()` using formula = 'y ~ x'

Analyzing the 24-hour activity cycle
A more sophisticated analysis might involve looking at how 24-hour
activity changes with age. This, too, can be modeled; we need a
functional data model to do so. Below, I use the regression framework
introduced by Sonja Greven and Fabian Scheipl in the refund
package, which extends the scalar GAM from above to work with functional
data. Here, our “function” of interest is the 24-hour activity curve
that we plotted earlier. Each subject’s 24-hour day is one observation,
and we can study how that observation changes as a function of
covariates like age or whether this subject has a history of falls.
Greven and Schiepl’s framework is very clever: it reframes functional
regression as scalar regression, with the residuals being i.i.d.
conditional on the model. That just means you need to model all of the
sources of non-iid variation in the data, and your model will be
correctly specified. The biggest change from the scalar model is the
introduction of smooth residuals, \(E_i(t)\), which are modeled as random
functional effects.
The interface is pretty similar to MGCV. Here’s code that fits a
functional version of the same model from above, to study how the 24hr
activity changes as a function of age.
For simplicity, we consider only a constant linear effect of age
(c(age)), though remember on the log scale this implies
multiplicative effects.
#Warning - takes a while to run!
#for now
scalar_df$age5_centered <- (scalar_df$age - 75)/5
#Standardize to 75 yr old, 5yr change = one unit change for \beta (t)
scalar_df$sub_code <- factor(scalar_df$sub_code) #mgcv requires factors for bs="re"
Y_lp1 <- log(Y + 1) # Log(x+1) transform, better for Gaussian link function
y_ix <- seq(0,24, length.out = 1440) #Time index, hours (0-24)
k_int <- 24 #knots for global intercept - try 16-24
k_beta <- 64 #knots for functional effect including random effects - try 48-64
ctrl <- mgcv::gam.control(trace = TRUE) # verbose=1
mod_fx <- pffr(Y_lp1 ~ c(age5_centered) + s(sub_code, bs="re"),
data = scalar_df,
yind = y_ix, #Vector of time index, from 0 to 24 hours
bs.yindex = list(bs="cp", k=k_beta, m=c(2,1)), #k cubic P-splines
bs.int = list(bs = "cp", k=k_int, m=c(2,1)),
algorithm = "bam",
method = "fREML", #fast REML approximation
control = ctrl,
discrete = TRUE) #use discrete approximation for speed
## Setup complete. Calling fit
## Fit complete. Finishing gam object.
## user.self sys.self elapsed
## initial 0.19 0.02 0.21
## gam.setup 20.29 0.17 20.48
## pre-fit 0.00 0.00 0.00
## fit 1041.77 8.19 1050.78
## finalise 0.01 0.00 0.02
summary(mod_fx, re.test = FALSE) #re.test slows down summary, not needed here
##
## Family: gaussian
## Link function: identity
##
## Formula:
## Y_lp1 ~ c(age5_centered) + s(sub_code, bs = "re")
##
## Constant coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.61230 0.01916 31.958 < 2e-16 ***
## age5_centered -0.04612 0.01657 -2.783 0.00538 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Smooth terms & functional coefficients:
## edf Ref.df F p-value
##
## R-sq.(adj) = 0.63 Deviance explained = 64.6%
## fREML score = 63022 Scale est. = 0.29327 n = 72000 (50 x 1440)
plot(mod_fx, pages=1)

#Hack way to plot results
pred_df <- predict(mod_fx) %>% as.data.frame() %>%
pivot_longer(everything(), names_prefix = "V", values_to = "yhat",
names_to = "sample") %>%
mutate(sample = as.numeric(sample),
hours = (sample-1)/1439*24) %>% #care, off by one
mutate(yhat_raw = exp(yhat) - 1) %>%
mutate(sub_code = rep(scalar_df$sub_code, each = 1440))
plot_pred_df <- pred_df %>%
filter(sub_code %in% plt_subs)
day_df %>%
filter(sub_code %in% plt_subs) %>%
ggplot(aes(x=hours, y=MIMS_UNIT, group = sub_code)) +
geom_ribbon(aes(ymax = MIMS_UNIT, ymin=0), fill = "navy", alpha = 0.2) +
geom_line(linewidth = 0.25, color="navy", alpha = 0.4) +
geom_line(aes(y=yhat_raw), data = plot_pred_df,
color = "red", linewidth = 1, alpha = 0.6) +
facet_wrap(~sub_code) +
scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.02,0))

View on linear predictor scale (log(x+1))
day_df %>%
mutate(log_p1_mims = log(MIMS_UNIT + 1)) %>%
filter(sub_code %in% plt_subs) %>%
ggplot(aes(x=hours, y=log_p1_mims, group = sub_code)) +
geom_ribbon(aes(ymax = log_p1_mims, ymin=0), fill = "navy", alpha = 0.2) +
geom_line(linewidth = 0.25, color="navy", alpha = 0.4) +
geom_line(aes(y=yhat), data = plot_pred_df,
color = "red", linewidth = 1, alpha = 0.6) +
facet_wrap(~sub_code) +
scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.02,0))

Viewing functional model:
#Hacking the underlying GAM to get raw predictions
#y_ix.vec name is a consequence of passing variable y_ix earlier.
#If we used t, it would be called t.vec
int_df <- data.frame(y_ix.vec = seq(0,24,length.out = 1440)) %>%
mutate(sub_code = factor("CO002"),
age5_centered = 1)
#Will ignore subject when constructing intercept, needed to dodge predict warning
#iterms to include intercept uncertainty
beta_preds <- mgcv::predict.gam(mod_fx, type="iterms",
newdata = int_df, se.fit = TRUE)
#Then do response for yhat on lp1 scale
smooth_select <- paste("s(y_ix.vec):", "c(age5_centered)", sep="")
smooth_select <- "age5_centered" #if c(age5_centered)
y_constant <- attr(beta_preds, "constant")
y_functional <- beta_preds$fit[,"s(y_ix.vec)"]
y_functional_se <- beta_preds$se.fit[,"s(y_ix.vec)"]
beta_t <- beta_preds$fit[,smooth_select]
#LP - linear predictor + 95% CIs
y_LP <- y_constant + y_functional
y_LP_low <- y_constant + y_functional - 1.96*y_functional_se
y_LP_hi <- y_constant + y_functional + 1.96*y_functional_se
#Also hack-ish, would be better with tidyfun:: methods but would introduce more dependencies
beta_df <- data.frame(hours = seq(0,24,length.out=1440),
y_LP_65 = y_constant + y_functional + -2*beta_t,
y_LP_70 = y_constant + y_functional + -1*beta_t,
y_LP_75 = y_constant + y_functional + 0*beta_t,
y_LP_80 = y_constant + y_functional + 1*beta_t,
y_LP_85 = y_constant + y_functional + 2*beta_t,
y_LP = y_constant + y_functional,
y_LP_low = y_constant + y_functional - 1.96*y_functional_se,
y_LP_hi = y_constant + y_functional + 1.96*y_functional_se) %>%
mutate(yhat_raw = exp(y_LP) - 1,
yhat_raw_low = exp(y_LP_low) - 1,
yhat_raw_hi = exp(y_LP_hi) - 1
) %>%
mutate(y_65 = exp(y_LP_65) - 1,
y_70 = exp(y_LP_70) - 1,
y_75 = exp(y_LP_75) - 1,
y_80 = exp(y_LP_80) - 1,
y_85 = exp(y_LP_85) - 1)
beta_df %>%
ggplot(aes(x=hours, y=yhat_raw)) +
geom_line(color = "navy") +
geom_ribbon(aes(ymin = yhat_raw_low, ymax = yhat_raw_hi),
fill = "navy", alpha = 0.2) +
scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.01,0),
labels = formatted_hours,
name = 'Time of day') +
ggtitle("Estimated average daily activity pattern for a 75 yr old")

Plotting age estimates:
lwd <- 2
age_col <-viridis(5, direction = 1) #5 colors for 65-85
beta_df %>%
ggplot(aes(x=hours, y=yhat_raw)) +
geom_line(aes(y=y_65), color = age_col[1], linewidth = lwd) +
geom_line(aes(y=y_70), color = age_col[2], linewidth = lwd) +
geom_line(aes(y=y_75), color = age_col[3], linewidth = lwd) +
geom_line(aes(y=y_80), color = age_col[4], linewidth = lwd) +
geom_line(aes(y=y_85), color = age_col[5], linewidth = lwd) +
geom_point(aes(color = hours), alpha = 0) +
scale_color_viridis_c(option="viridis", limits = c(65,85),
breaks = c(65, 70, 75, 80, 85),
labels = c('65', '70', '75', '80', '85')) +
guides(color = guide_colorbar(title = "Age (yrs)")) +
scale_x_continuous(limits=c(0,24), breaks = hour_seq, expand = c(0.01,0),
labels = formatted_hours,
name = 'Time of day') +
ggtitle("Estimated effect of age on 24-hr activity") +
theme(legend.position = "bottom")

Avenues for improvement
- Using repeated days within subjects (raw dataset has 2-3 days per
person)
- Intelligently dealing with missing data, either through sparse FDA
or other methods
- Trying a poisson, beta, or other extended-family regression model to
deal with zero-inflation
- Using principal component smooth residuals instead of P-spline basis
residuals (
pcre() terms in pffr)
Some residual diagnosics
We can check how “good” our iid assumption is by plotting the
residual correlation matrix. A perfectly-specified pffr
model should be an identity matrix (all off-diagonal correlations =
0)
I suspect the worse-ish autocorrelation in the upper left (remember,
that’s midnight to ~5am) is from some, but not all, subjects taking the
device off at night. As above, that might be addressed by a cleverer
transformation than log(x+1).
resid_E = residuals(mod_fx)
R_E <- cor(resid_E)
melted_R <- melt(R_E)
ggplot(data = melted_R, aes(x=Var2, y=Var1, fill=value)) +
geom_raster() +
scale_x_continuous(expand = c(0,0)) +
scale_y_reverse(expand = c(0,0)) +
scale_fill_gradientn(colors = c("#e41a1c", "white", "#377eb8"),
limits = c(-1, 1),
breaks = c(-1, 0, 1)) +
ggtitle('Residual autocorrelation in original model')

Fit an intentionally-misspecified model and compare
If we skip the subject/curve-specific random effects (one in the same
since we have one day from each subject) we misspecify the model and
should have a lot more residual autocorrelation. Do we?
mod_mis <- pffr(Y_lp1 ~ c(age5_centered),
data = scalar_df,
yind = y_ix, #Vector of time index, from 0 to 24 hours
bs.yindex = list(bs="cp", k=k_beta, m=c(2,1)), #k cubic P-splines
bs.int = list(bs = "cp", k=k_int, m=c(2,1)),
algorithm = "bam",
method = "fREML", #fast REML approximation
control = ctrl,
discrete = TRUE) #use discrete approximation for speed
## Setup complete. Calling fit
## Fit complete. Finishing gam object.
## user.self sys.self elapsed
## initial 0.03 0 0.03
## gam.setup 0.00 0 0.00
## pre-fit 0.00 0 0.00
## fit 0.02 0 0.01
## finalise 0.01 0 0.02
#summary(mod_mis, re.test = FALSE)
#plot(mod_mis, pages=1)
resid_M = residuals(mod_mis)
R_M <- cor(resid_M)
melted_RM <- melt(R_M)
ggplot(data = melted_RM, aes(x=Var2, y=Var1, fill=value)) +
geom_raster() +
scale_x_continuous(expand = c(0,0)) +
scale_y_reverse(expand = c(0,0)) +
scale_fill_gradientn(colors = c("#e41a1c", "white", "#377eb8"),
limits = c(-1, 1),
breaks = c(-1, 0, 1)) +
ggtitle('Residual autocorrelation in misspecified model')

Much worse!